In [1]:
import pandas as pd
import gzip
import json
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline
pd.options.mode.chained_assignment = None
Amazon is reshaping and recording our life as the biggest online retailer and as an invisible social network. Because of our behaviors, purchasing, reviewing or just viewing, the products in Amazon are also connected together. In this project, we focus on the Electronics product network.
Specfically, at first, we generate our network from three products, Macbook, Surface and ThinkPad, which we call "core" here. Starting from the core, we add the also_bought
prodcuts into our network, which we call the "first layer" of our network. Then, we add the also_bought
products of the first layer. In this project, we only add two layers of products. And, intuitively, each core is going to form one group of nodes.
The weight of edge between two products(nodes) is calculated by the number of shared also_bought
products only or calculated by the number of shared also_bought
products and the number of shared reviewers together.
After generating our network, we are going to use Spectral Graph Theory, Transductive Learning and Surpervised Classification to try to do clustering and compare the results of those methods.
The project is a combination of two spearate analysis, one is based only on Macbook and Surface and another is the study of all three groups. In each analysis, we weight our edge in two different ways and try to find out the difference. The details are shown in the index below.
A. Study Basic Proporities of Graph
B. Spectral Graph Theory
B. Transductive Learning
5.2 Use both 'also_bought' and 'reviewer'
A. Spectral Graph Theory
B. Surpervised Classification
5.3 Network Graphs for Mackbook and surface related products comparison
Study on Macbook, Surface and ThinkPad
A. Spectral Graph Theory
B. Transductive Learning
6.2 Use both 'also_bought' and 'reviewer'
A. Spectral Graph Theory
B. Surpervised Classification
Attention
:
Because the original data files are extremely big, we cannot upload it to the Moodle and due to that fact, it takes huge amount of time for data loading and pre-processing. If you want to execute our analysis part, you can directly start from the part 4 where we have already generated the small pickle files of the clean and useful data. You can load it from the part 4.
In this project, we choose to use the subset Electronics of Amazon Product Dataset, provided by Julian McAuley, UCSD. The dataset includes product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/ also bought graphs), and millions of reviews (ratings, text, helpful- ness votes) spanning from May 1996 to July 2014.
The two datasets we use for the project are metadata of Electronics and 5-core review data of Electronics.
5-core is the dense subset of the original review dataset and in this subset all users and items have at least 5 reviews.
In the dataset of metadata, format is one-product-per-line in json and following attributes of products are included:
In the dataset of reviews, format is one-review-per-line in (loose) json. One typical review has following attributes:
In [2]:
# useful functions for data loading
# which are provided by the source of the datasets
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
In [3]:
# load data
df_product = getDF('data/meta_Electronics.json.gz')
df_review = getDF('data/reviews_Electronics_5.json.gz')
# generate pickle files in order to save the future laoding time
df_product.to_pickle('data/origin_product.pkl')
df_review.to_pickle('data/origin_review.pkl')
Now, let's have a look at our datasets. Below are the structures of the two dataframes:
In [4]:
df_product.head()
Out[4]:
In [5]:
df_review.head()
Out[5]:
In this part, we do some data cleaning and preprocessing works for electronic product reviews dataset.
We collect all the reviewer's IDs for each product and arrange them in a new dataframe shown as below.
Actually, we have chosen a 5-core structure subset for review dataset, which means the data in which all users and items have at least 5 reviews. Therefore, for each asin, there are at least 5 related reviewer' IDs.
In [6]:
# group by asin(product ID) and save all reviewer's IDs for each asin into a list
df_product_reviewer = pd.DataFrame(df_review.groupby(['asin'])['reviewerID'].apply(list))
df_product_reviewer.to_pickle('data/product_reviewer.pkl')
In [7]:
df_product_reviewer.head()
Out[7]:
In [8]:
print("There are {} electronic products who have at least 5 reviews in our dataset.".format(df_product_reviewer.shape[0]))
In [9]:
reviewers = set([x for list_ in df_product_reviewer['reviewerID'].values for x in list_])
In [10]:
print("There are {} reviewers who have commented on at least on product in our dataset.".format(len(reviewers)))
In this part, we do some data cleaning and preprocessing works for electronic products dataset.
First of all, we drop the following columns which are useless for our reserach: imUrl, salesRank, brand, description, categories, title, price.
Then, we plan to discover the inner connection among the different electronic products. The column related is very essential for us, since it is actucally a dictionary of the information of also_viewed, also_bought, bought_together, buy_after_viewing product asins. However, for each product, these 4 attributes may not have non-null values at the same time.
We found that there exists some Nan values in the column related. Hence, we should drop them out.
In [11]:
# Drop useless columns
df_product.drop(['imUrl','salesRank','brand','description','categories','title','price'],axis=1,inplace=True)
# Drop rows which related column is Nan value
df_product = df_product[df_product['related'].notnull()]
# update product pickle file
df_product.to_pickle('data/product_with_related.pkl')
# Collect all electronic products' asin
product_asin = df_product['asin'].values
Now, we decided to only keep products which exist in both metadata and review datasets.
First of all, we check if all products in the review dataset have the corresponding product information in metadata dataset.
In [12]:
# get all review product asins
product_with_review_asin = df_product_reviewer.index.values
# get all metadata product asins
product_asin = df_product['asin'].values
if (list(set(product_asin) & set(product_with_review_asin)) == list(product_with_review_asin)):
print("All products who possess reviews have the corresponding product description in metadata dataset!")
else:
print("There exists some review products which is not in metadata dataset.")
As there are some products who are not in both two datasets, we are supposed to drop these products from the two datasets and only keep the intersection.
In [13]:
num_product = len(list(set(product_asin) & set(product_with_review_asin)))
print("There are {} electronic products in both metadata and review datasets".format(num_product))
In [14]:
# Clean review dataset and update pickle file
product_to_keep_asin = list(set(product_asin) & set(product_with_review_asin))
df_product_reviewer_new = df_product_reviewer.loc[product_to_keep_asin]
df_product_reviewer_new.to_pickle('data/review_intersection.pkl')
# Clean product dataset and update pickle file
df_product.set_index(['asin'],inplace=True)
df_product = df_product.loc[product_to_keep_asin]
df_product.to_pickle('data/product_intersection.pkl')
At the very beginning, we planned to create a network to present the relationship among one product and its potential also-bought products. Therefore, we add a new column also_bought for those products who have such also_bought products whose information are also contained in our metadata dataset.
In [15]:
df_product = pd.read_pickle('data/product_intersection.pkl')
product_asin = df_product.index.values
In [16]:
def also_bought(x):
# verify if the product has also_bought atteibute
if 'also_bought' in x.keys():
temp = []
for asin in x['also_bought']:
# verify the also_bought product is also in metadata dataset
if asin in product_asin:
temp.append(asin)
return temp
else:
return []
In [17]:
df_product['also_bought'] = df_product['related'].apply(lambda x: also_bought(x))
# update pickle file after adding two new columns
df_product.to_pickle('data/product_also_bought.pkl')
df_product.head()
Out[17]:
As there are some products who don't have any also_bought products after the previous processing, we drop them out.
In [18]:
df_product_bought = df_product[['also_bought']]
# drop products who don't have also_bought related products
df_product_bought['also_bought'] = df_product_bought['also_bought'].apply(lambda x: 'unknown' if x == [] else x)
df_product_bought = df_product_bought[~(df_product_bought['also_bought'] == 'unknown')]
# update also_bought column since we dropped some product from the dataframe
also_bought_index = df_product_bought.index.values
for asin in also_bought_index:
new_related = []
for related in df_product_bought.loc[asin].also_bought:
# verify if the also_bought product is still in metadata dataset
if related in also_bought_index:
new_related.append(related)
# update the also_bought column value
df_product_bought.loc[asin].also_bought = new_related
# update pickle file
df_product_bought.to_pickle('data/product_also_bought.pkl')
Now, we should update review dataset as well since we have just drop some products from metadata dataset.
In [19]:
# update review dataset & only keep reviews concerning the products which exist in new product dataset
# collect all product asin
all_product = df_product_bought.index.values
# load original review dataset
df_review = pd.read_pickle('data/review_intersection.pkl')
# keep reviews whose concerning product is in new product list
df_review_bought = df_review.loc[all_product]
# update review dataset
df_review_bought.to_pickle('data/review_also_bought.pkl')
Now, we finish all the data preprocessing work and get two useful dataset. We store them as two pickle files:
product_also_bought.pkl
: contains all the product information and also_bought related product asin.review_also_bought.pkl
: contains all the reviews concerning all of products in the first metadata dataset.
In [2]:
df_product = pd.read_pickle('data/product_also_bought.pkl')
df_review = pd.read_pickle('data/review_also_bought.pkl')
In [3]:
print("We keep {} electronic products to conduct our project.".format(df_product.shape[0]))
In this part, we build two groups of data, Mac group and Surface group.
In this project, we only add 2 layers of data.
In [4]:
def get_all_related_asin_from_list_asin(list_asin):
res = []
for asin in list_asin:
res.extend(df_product.loc[asin]['also_bought'])
res = list(set(res))
return res
'B005CWJB5G' is the asin for Macbook Air.
In [5]:
# Macbook
nb_layer = 2
df_mac = df_product[df_product.index == 'B005CWJB5G']
list_asin = ['B005CWJB5G']
for i in range(nb_layer):
index = get_all_related_asin_from_list_asin(list_asin)
df_mac = pd.concat([df_mac,df_product.loc[index]])
list_asin = index
df_mac = df_mac.reset_index().drop_duplicates(subset='asin')
print('We have {} products in the group of mac.'.format(df_mac.shape[0]))
'B00FG7MVJU' is the asin for Surface.
In [6]:
# Surface
nb_layer = 2
df_surface = df_product[df_product.index == 'B00FG7MVJU']
list_asin = ['B00FG7MVJU']
for i in range(nb_layer):
index = get_all_related_asin_from_list_asin(list_asin)
df_surface = pd.concat([df_surface,df_product.loc[index]])
list_asin = index
df_surface = df_surface.reset_index().drop_duplicates(subset='asin')
print('We have {} products in the group of surface.'.format(df_surface.shape[0]))
We note that there are some duplicates of two groups.
In [7]:
dupasin = list(set(df_mac.asin.values) & set(df_surface.asin.values))
print('We have {} duplicate products in two groups.'.format(len(dupasin)))
To balance two groups of data, we treat all duplicates as data in the group of surface.
Thus, we drop all of the duplicates from the mac group.
In [8]:
df_mac_new = df_mac.set_index('asin').drop(dupasin)
n_mac = df_mac_new.shape[0]
print('Now, we have {} products in the group of mac.'.format(n_mac))
Merge two groups of data together into one DataFrame.
We now know that the first n_mac
products are in the group of mac while the others are in the group of surface.
In [9]:
df_all = pd.concat([df_mac_new.reset_index(), df_surface])
print('We have {} products in all.'.format(df_all.shape[0]))
Generate the label for these two groups. 0 for Mac and 1 for Surface.
In [10]:
label_mac_surface = np.ones(df_all.shape[0])
label_mac_surface[0:n_mac] = 0
Have a look of our downsampled dataframe now.
In [11]:
df_all.head()
Out[11]:
In this part, we use the data we selected above to build the product graph.
Weight of Edge: We implement two ways to calculate the weight of edge.
Use only the information of 'also_bougtht': $$Weight = \# Shared\_Neighbors$$
Use both the information of 'also_bougtht' and 'reviewer':$$Weight = 0.2 * \frac{\# Shared\_Neighbors}{Average\_Shared\_Neighbors} + 0.8 * \frac{\# Shared\_Reviewers}{Average\_Shared\_Reviewers}$$
In [12]:
G = nx.Graph()
G.add_nodes_from(df_all.asin.values)
In [13]:
for i in range(len(df_all)):
for j in range(i+1, len(df_all)):
# calculate the number of shared "also_bought" products
set_len = len(set(df_all.iloc[i]['also_bought'])&set(df_all.iloc[j]['also_bought']))
# connect the two products and give weight to the edge
G.add_edge(df_all.iloc[i]['asin'], df_all.iloc[j]['asin'], weight=set_len)
In [14]:
M_mac_surface = nx.adjacency_matrix(G, weight='weight')
M_mac_surface.todense()
Out[14]:
In [15]:
node_number = len(G.nodes())
print('The network has {} nodes.'.format(node_number))
edge_number = len(np.nonzero(M_mac_surface)[0])/2
print('The network has {:.0f} edges.'.format(edge_number))
In [16]:
max_edge = node_number*(node_number-1)/2
print ("The maximum number of edges of our network is {:.0f}.".format(max_edge))
rate = 1.0*edge_number/max_edge
print ("The rate of the real-edge/max-number is {:.2%}.".format(rate))
The maximum number of links in a network with $N$ nodes is $L_{max}=N(N-1)/2$. Now we have about 50% of them. This graph is not very sparse.
Degree distribution
In [17]:
d = M_mac_surface.todense().sum(axis=1)
plt.xlabel("Weighted Degree")
plt.ylabel("# Node")
plt.title("Degree Distribution")
plt.hist(d, bins=20);
The average degree is below
In [18]:
d_avg = np.mean(d)
print("The average degree is {:.2f}.".format(d_avg))
Graph generate models
Now wegenerate Erdős–Rényi and Barabási-Albert graph using NetworkX and compare with our graph
In [19]:
er = nx.erdos_renyi_graph(node_number, rate)
ba = nx.barabasi_albert_graph(node_number, int(edge_number/node_number))
In [20]:
print('My Erdős–Rényi network has {} nodes.'.format(len(er.nodes())))
print('My Erdős–Rényi network has {} edges.'.format(er.size()))
print('My Barabási-Albert network has {} nodes.'.format(len(ba.nodes())))
print('My Barabási-Albert network has {} edges.'.format(ba.size()))
From the generarion model, we can see that the number of edge of our network is more similar to that of Erdős–Rényi model which is a random graph.
In [21]:
giant_G = max(nx.connected_component_subgraphs(G), key=len)
giant_er = max(nx.connected_component_subgraphs(er), key=len)
giant_ba = max(nx.connected_component_subgraphs(ba), key=len)
print(len(giant_G.nodes()))
print(len(giant_er.nodes()))
print(len(giant_ba.nodes()))
All of the graph are fully connected.
In [22]:
print (nx.average_clustering(G))
print (nx.average_clustering(er))
print (nx.average_clustering(ba))
From the way we construct the graph, we can easily see that the graph is fully connected. Now we proved that.
And also we can see that the generate models can't mimic the clustering coeffcient of real graph.
Calculate the Similarity Matrix by normalizing the maximum as 1
In [23]:
M_mac_surface.max()
Out[23]:
In [24]:
S_mac_surface = M_mac_surface.todense()/M_mac_surface.max()
np.fill_diagonal(S_mac_surface ,0)
Keep the TOP K Neighbors.
We first use K = 300.
In [25]:
W_mac_surface = S_mac_surface.getA().copy()
NEIGHBORS = 300
# Keep only the K strongest edges
# By making weights of all other edges equal to 0
ind = np.argsort(W_mac_surface)[:,::-1][:,NEIGHBORS:]
for i in range(W_mac_surface.shape[1]):
W_mac_surface[i][ind[i]]=0
# Make it symmestric
bigger = W_mac_surface.transpose() > W_mac_surface
W_mac_surface = W_mac_surface - W_mac_surface*bigger + W_mac_surface.transpose()*bigger
In [26]:
W_mac_surface
Out[26]:
Calculate the Laplacian.
In [27]:
degrees = np.sum(W_mac_surface, axis=0)
D = np.diag(degrees)
D_0_5 = np.diag(1./np.sqrt(degrees))
laplacian = np.dot(D_0_5, np.dot((D - W_mac_surface), D_0_5))
plt.spy(laplacian)
plt.show()
Eigenvalue decomposition of the Laplacian.
In [28]:
from scipy import sparse, stats, spatial
laplacian = sparse.csr_matrix(laplacian)
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, which='SM')
plt.plot(eigenvalues, '.-', markersize=15);
Plot the groud truth and the prediction based on spectral graph theory.
The ground truth is label_mac_surface
in which we label 0 for Mac group and 1 for Surface group.
In [32]:
x1 = eigenvectors[:,1]
y1 = eigenvectors[:,2]
fig, ax = plt.subplots(2,1,figsize=(20,10))
ax[0].scatter(x1, y1, c=label_mac_surface, cmap='RdBu', alpha=0.5)
ax[0].set_title('Ground Truth')
ax[0].grid()
ax[1].scatter(x1, y1, c=(x1<0), cmap='RdBu', alpha=0.5)
ax[1].set_title('Spectral Graph Theory')
ax[1].grid()
Error rate
In [33]:
error_rate = np.sum((x1<0) != label_mac_surface)/len(label_mac_surface)
print("The error rate of Spectral Graph Theory is {:.2%}".format(error_rate))
In [34]:
# Ground truth.
x = np.ones(len(G.nodes))
x[0:n_mac] = -1
def prepare_observations(p):
"""Prepare observations, where p is the percentage of values to keep."""
rs = np.random.RandomState(42)
M = np.diag(rs.uniform(size=len(G.nodes)) < p)
return M.dot(x)
# Play with the percentage of observed values.
y = prepare_observations(p=0.1)
In [35]:
def solve(y, alpha):
"""
Solve the optimization problem.
Parameters:
y: the observations
alpha: the balance between fidelity and smoothness prior.
Returns:
x_pred: the predicted class
x_star: the solution of the optimization problem
"""
M = np.diag(y!=0)
x_star = np.linalg.solve(M+alpha*laplacian, M.dot(y))
x_pred = np.ones(len(G.nodes))
x_pred[x_star<0] = -1
return x_pred, x_star
# Try with default setting
x_pred, x_star = solve(y, alpha=1)
# Be sure that the prediction is binary.
np.testing.assert_equal(abs(x_pred), 1)
# Error rate.
err = np.sum(x != x_pred)
print('{} errors ({:.2%})'.format(err, err/len(x)))
In [36]:
# Paramete sweeping
alpha = 0.1
for p in [0.1, 0.3, 0.5, 0.7, 0.9]:
y = prepare_observations(p)
x_pred, _ = solve(y, alpha)
err = np.sum(x != x_pred)
print('p = {}: {:6.2%}'.format(p, err/len(x)))
The result of transductive learning is pretty good since it is half-surpervised learning. And only with only 30% of known facts, we can get an error which is less than 10%. And that's very useful in real life problems because we often do not know everything of our dataset and the label information.
Plot
In [37]:
x1 = eigenvectors[:,1]
y1 = eigenvectors[:,2]
fig, ax = plt.subplots(2,1,figsize=(20,10))
ax[0].scatter(x1, y1, c=label_mac_surface, cmap='RdBu', alpha=0.5)
ax[0].set_title('Ground Truth')
ax[0].grid()
ax[1].scatter(x1, y1, c=x_pred, cmap='RdBu', alpha=0.5)
ax[1].set_title('Transductive Learning')
ax[1].grid()
In [38]:
df_review.head()
Out[38]:
Select the review information of data in our DataFrame.
In [39]:
df_review_chosen = df_review.loc[df_all.asin].reset_index()
Build the graph as we did above.
In [40]:
GB = nx.Graph()
GB.add_nodes_from(df_all.asin.values)
In [41]:
b_len_list = [] # list for # shared also_bought
r_len_list = [] # list for # shared reviewer
for i in range(len(df_all)):
for j in range(i+1, len(df_all)):
b_len_list.append(len(set(df_all.iloc[i]['also_bought'])&set(df_all.iloc[j]['also_bought'])))
r_len_list.append(len(set(df_review_chosen.iloc[i]['reviewerID'])&set(df_review_chosen.iloc[j]['reviewerID'])))
In [42]:
# Study the basics of two lists
df_b_r = pd.DataFrame({'b': b_len_list, 'r': r_len_list})
# Only show the non-zero value distribution
pd.DataFrame([df_b_r[df_b_r.b>0].describe().b, df_b_r[df_b_r.r>0].describe().r])
Out[42]:
As the table shown above, the distribution of the number of shared also_bought
products or shared reviewers are quite skewed. The mean value is slightly below the third quartile which means there are only a few extremely large values in the two list. And, in order to add the two number together, we choose to normalize them a liite bit using the mean of all non-zero elements.
In [43]:
b_len_list.reverse()
r_len_list.reverse()
mean_b = np.mean(df_b_r[df_b_r.b>0].b)
mean_r = np.mean(df_b_r[df_b_r.r>0].r)
for i in range(len(df_all)):
for j in range(i+1, len(df_all)):
b_len = b_len_list.pop()/mean_b
r_len = r_len_list.pop()/mean_r
# The weight is manually chosen now and may need to be modified
GB.add_edge(df_all.iloc[i]['asin'], df_all.iloc[j]['asin'], weight=0.2*b_len+0.8*r_len)
In [44]:
M_mac_surface_2 = nx.adjacency_matrix(GB,weight='weight')
M_mac_surface_2.todense()
Out[44]:
Similarity Matrix
In [56]:
M_mac_surface_2.max()
Out[56]:
In [57]:
S_mac_surface_2 = M_mac_surface_2.todense()/M_mac_surface_2.max()
np.fill_diagonal(S_mac_surface_2,0)
TOP K Neighbors
In [58]:
W_mac_surface_2 = S_mac_surface_2.getA().copy()
NEIGHBORS = 300
# Keep only the K strongest edges
# By making weights of all other edges equal to 0
ind = np.argsort(W_mac_surface_2)[:,::-1][:,NEIGHBORS:]
for i in range(W_mac_surface_2.shape[1]):
W_mac_surface_2[i][ind[i]]=0
# Make it symmestric
bigger = W_mac_surface_2.transpose() > W_mac_surface_2
W_mac_surface_2 = W_mac_surface_2 - W_mac_surface_2*bigger + W_mac_surface_2.transpose()*bigger
Calculate Laplacian
In [59]:
degrees = np.sum(W_mac_surface_2, axis=0)
D = np.diag(degrees)
D_0_5 = np.diag(1./np.sqrt(degrees))
laplacian = np.dot(D_0_5, np.dot((D - W_mac_surface_2), D_0_5))
plt.spy(laplacian)
plt.show()
Eigenvalue Decomposition
In [60]:
from scipy import sparse, stats, spatial
laplacian = sparse.csr_matrix(laplacian)
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, which='SM')
plt.plot(eigenvalues, '.-', markersize=15);
Plot
In [62]:
x2 = eigenvectors[:,1]
y2 = eigenvectors[:,2]
fig, ax = plt.subplots(2,1,figsize=(20,10))
ax[0].scatter(x2, y2, c=label_mac_surface, cmap='RdBu', alpha=0.5)
ax[0].set_title('Ground Truth')
ax[0].grid()
ax[1].scatter(x2, y2, c=(x2>0), cmap='RdBu', alpha=0.5)
ax[1].set_title('Spectral Graph Theory')
ax[1].grid()
Error Rate
In [64]:
error_rate_2 = np.sum((x2>0) != label_mac_surface)/len(label_mac_surface)
print("The error rate of Spectral Graph Theory is {:.2%}".format(error_rate_2))
With the information of reviewers, the error rate goes down! Now, I try to explain this result. The also_bought
information is gerenrated by Amazon based on the behaviours of a large number of users which indirectly reflects the connection between two products. On the other hand, for the dataset of reviewers, the information of shared reviewers directly reflects how often people buy the two products together. So, with the information of reviewers, we get lower error rate.
In [65]:
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.25),
SVC(gamma=2, C=1),
GaussianProcessClassifier(1.0 * RBF(1.0)),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA"]
score = []
In [66]:
from sklearn.model_selection import train_test_split
X, y = np.array(M_mac_surface_2.todense()), np.array(label_mac_surface)
# 90% is training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/10)
In [67]:
# iterate over classifiers
# without parameter tuning
for clf in classifiers:
clf.fit(X_train, y_train)
score.append(clf.score(X_test, y_test))
In [68]:
error = np.ones(np.shape(score))-score
plt.figure(figsize=(13,5))
plt.xlabel("Model")
plt.ylabel("Error Rate")
plt.title("Machine Learning Model Comparision")
plt.grid()
plt.xticks(range(len(error)), names, rotation=45)
plt.plot(range(len(error)), error, "b*-", linewidth=1)
plt.plot(range(len(error)), [error_rate_2]*len(error), "r--", linewidth=1.5);
Now we can see that different machine learning models vary a lot in this problem.
We can see clearly the difference between machine learning models and spectral graph theroy models. As we implement before, the error rate of our spectral graph model is liitle above 15% which beats several machine learning models without further parameter tuning.
https://nbviewer.jupyter.org/gist/xiangzhemeng/deb2495e7f9af7953d69859216aee7b8
Load the data.
In [69]:
df_product = pd.read_pickle('data/product_also_bought.pkl')
df_review = pd.read_pickle('data/review_also_bought.pkl')
Find the also_bought
products of list of products.
In [70]:
def get_all_related_asin_from_list_asin(list_asin):
res = []
for asin in list_asin:
res.extend(df_product.loc[asin]['also_bought'])
res = list(set(res))
return res
In [71]:
# Macbook
nb_layer = 2
df_mac = df_product[df_product.index == 'B005CWJB5G']
list_asin = ['B005CWJB5G']
for i in range(nb_layer):
index = get_all_related_asin_from_list_asin(list_asin)
df_mac = pd.concat([df_mac,df_product.loc[index]])
list_asin = index
df_mac = df_mac.reset_index().drop_duplicates(subset='asin')
print('We have {} products in the group of mac.'.format(df_mac.shape[0]))
In [72]:
# Surface
nb_layer = 2
df_surface = df_product[df_product.index == 'B00FG7MVJU']
list_asin = ['B00FG7MVJU']
for i in range(nb_layer):
index = get_all_related_asin_from_list_asin(list_asin)
df_surface = pd.concat([df_surface,df_product.loc[index]])
list_asin = index
df_surface = df_surface.reset_index().drop_duplicates(subset='asin')
print('We have {} products in the group of surface.'.format(df_surface.shape[0]))
In [73]:
# ThinkPad
nb_layer = 2
df_think = df_product[df_product.index == 'B00JP7OGQ8']
list_asin = ['B00JP7OGQ8']
for i in range(nb_layer):
index = get_all_related_asin_from_list_asin(list_asin)
df_think = pd.concat([df_think,df_product.loc[index]])
list_asin = index
df_think = df_think.reset_index().drop_duplicates(subset='asin')
print('We have {} products in the group of ThinkPad.'.format(df_think.shape[0]))
In [74]:
# FORMER CODE
# Find the duplicated products among three groups.
dupasin_mac_surface = list(set(df_mac.asin.values) & set(df_surface.asin.values))
print('We have {} duplicate products in Mac and Surface.'.format(len(dupasin_mac_surface)))
dupasin_mac_think = list(set(df_mac.asin.values) & set(df_think.asin.values))
print('We have {} duplicate products in Mac and ThinkPad.'.format(len(dupasin_mac_think)))
dupasin_think_surface = list(set(df_think.asin.values) & set(df_surface.asin.values))
print('We have {} duplicate products in ThinkPad and Surface.'.format(len(dupasin_think_surface)))
dupasin = list(set(df_think.asin.values) & set(df_surface.asin.values) & set(df_mac.asin.values))
print('We have {} duplicate products.'.format(len(dupasin)))
In [75]:
df_think_surface = pd.concat([df_think.reset_index(), df_surface])
df_think_mac = pd.concat([df_think.reset_index(), df_mac])
df_mac_surface = pd.concat([df_mac.reset_index(), df_surface])
# distinct mac products
mac_duplicate = list(set(df_mac.asin.values) & set(df_think_surface.asin.values))
print('We have {} distinct products for Macbook.'.format(len(df_mac)-len(mac_duplicate)))
# distinct surface products
surface_duplicate = list(set(df_surface.asin.values) & set(df_think_mac.asin.values))
print('We have {} distinct products for Surface.'.format(len(df_surface)-len(surface_duplicate)))
# distincy thinkpad products
think_duplicate = list(set(df_think.asin.values) & set(df_mac_surface.asin.values))
print('We have {} distinct products for Thinkpad.'.format(len(df_think)-len(think_duplicate)))
In [76]:
import matplotlib.pyplot as plt
from matplotlib_venn import venn3
v=venn3(subsets = (541, 304, 54, 365, 68, 61, 80), set_labels = ('Mac', 'Surface', 'ThinkPad'))
plt.show()
We only keep products distinct to one of the groups.
In [77]:
df_think_new = df_think[~df_think['asin'].isin(think_duplicate)]
df_mac_new = df_mac[~df_mac['asin'].isin(mac_duplicate)]
df_surface_new = df_surface[~df_surface['asin'].isin(surface_duplicate)]
In [78]:
df_all_new = pd.concat([df_think_new.reset_index(), df_mac_new, df_surface_new])
print('We have {} products in all.'.format(df_all_new.shape[0]))
In [79]:
df_all_new.head()
Out[79]:
Same as before, we use the data we selected above (in df_all_new) to build the product graph.
Weight of Edge: We implement two ways to calculate the weight of edge.
Use only the information of 'also_bougtht': $$Weight = \# Shared\_Neighbors$$
Use both the information of 'also_bougtht' and 'reviewer':$$Weight = 0.2 * \frac{\# Shared\_Neighbors}{Average\_Shared\_Neighbors} + 0.8 * \frac{\# Shared\_Reviewers}{Average\_Shared\_Reviewers}$$ where $Average$ is the mean of non-zero elements not of all elements.
In [80]:
G = nx.Graph()
G.add_nodes_from(df_all_new.asin.values)
In [81]:
for i in range(len(df_all_new)):
for j in range(i+1, len(df_all_new)):
set_len = len(set(df_all_new.iloc[i]['also_bought'])&set(df_all_new.iloc[j]['also_bought']))
G.add_edge(df_all_new.iloc[i]['asin'], df_all_new.iloc[j]['asin'], weight=set_len)
In [82]:
A = nx.adjacency_matrix(G, weight='weight')
A.todense()
Out[82]:
Calculate the Similarity Matrix by normalizing the maximum as 1
In [300]:
A.max()
Out[300]:
In [301]:
similar = A.todense()/A.max()
np.fill_diagonal(similar,0)
Keep the TOP K Neighbors.
We first use K = 300.
In [302]:
weights = similar.getA().copy()
NEIGHBORS = 300
# Keep only the K strongest edges
# By making weights of all other edges equal to 0
ind = np.argsort(weights)[:,::-1][:,NEIGHBORS:]
for i in range(weights.shape[1]):
weights[i][ind[i]]=0
# Make it symmestric
bigger = weights.transpose() > weights
weights = weights - weights*bigger + weights.transpose()*bigger
In [303]:
weights
Out[303]:
Calculate the Laplacian.
In [304]:
degrees = np.sum(weights, axis=0)+0.0001
D = np.diag(degrees)
D_0_5 = np.diag(1./np.sqrt(degrees))
laplacian = np.dot(D_0_5, np.dot((D - weights), D_0_5))
In [305]:
plt.spy(laplacian)
plt.show()
Eigenvalue decomposition of the Laplacian.
In [306]:
from scipy import sparse, stats, spatial
laplacian = sparse.csr_matrix(laplacian)
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, which='SM')
plt.plot(eigenvalues, '.-', markersize=15);
Ground truth: we label all of data in think_pad group as 0, surface group as 1 and mac group as 2.
In [307]:
labels = np.ones(A.shape[0])
labels[0:len(df_think_new)] = 0
labels[len(df_think_new):len(df_think_new)+len(df_mac_new)] = 1
labels[len(df_think_new)+len(df_mac_new):len(df_think_new)+len(df_mac_new)+len(df_surface_new)] = 2
Plot the groud truth and the prediction based on spectral graph theory.
In [310]:
x1 = eigenvectors[:,1]
y1 = eigenvectors[:,2]
colorList = list()
for i in x1:
if i < -.02:
colorList.append(2)
elif i > 0:
colorList.append(1)
else:
colorList.append(0)
fig, ax = plt.subplots(2,1,figsize=(20,10))
ax[0].scatter(x1, y1, c=labels, alpha=0.5)
ax[0].set_title('Ground Truth')
ax[0].grid()
ax[1].scatter(x1, y1, c=colorList, alpha=0.5)
ax[1].set_title('Spectral Graph Theory')
ax[1].grid()
Error rate
In [311]:
error_rate = np.sum(colorList != labels)/len(labels)
print("The error rate of Spectral Graph Theory is {:.2%}".format(error_rate))
In [281]:
# Ground truth.
x = np.ones(len(G.nodes))
x[0:len(df_think_new)] = -1
x[len(df_think_new):len(df_think_new)+len(df_mac_new)] = 1
x[len(df_think_new)+len(df_mac_new):len(df_think_new)+len(df_mac_new)+len(df_surface_new)] = 3
# x[0:n_mac] = -1
def prepare_observations(p):
"""Prepare observations, where p is the percentage of values to keep."""
rs = np.random.RandomState(42)
M = np.diag(rs.uniform(size=len(G.nodes)) < p)
return M.dot(x)
# Play with the percentage of observed values.
y = prepare_observations(p=0.1)
In [282]:
def solve(y, alpha):
"""
Solve the optimization problem.
Parameters:
y: the observations
alpha: the balance between fidelity and smoothness prior.
Returns:
x_pred: the predicted class
x_star: the solution of the optimization problem
"""
M = np.diag(y!=0)
x_star = np.linalg.solve(M+alpha*laplacian, M.dot(y))
x_pred = np.ones(len(G.nodes))
for i in range(len(x_star)):
if x_star[i] < 0:
x_pred[i] = -1
elif x_star[i] > 0 and x_star[i] < 2:
x_pred[i] = 1
else:
x_pred[i] = 3
return x_pred, x_star
x_pred, x_star = solve(y, alpha=1)
# Error rate.
err = np.sum(x != x_pred)
print('{} errors ({:.2%})'.format(err, err/len(G.nodes)))
In [283]:
alpha = 0.1
for p in [0.1, 0.3, 0.5, 0.7, 0.9]:
y = prepare_observations(p)
x_pred, _ = solve(y, alpha)
err = np.sum(x != x_pred)
print('{}: {:6.2%}'.format(p, err/len(G.nodes)))
Plot
In [284]:
x1 = eigenvectors[:,1]
y1 = eigenvectors[:,2]
fig, ax = plt.subplots(2,1,figsize=(20,10))
ax[0].scatter(x1, y1, c=labels, alpha=0.5)
ax[0].set_title('Ground Truth')
ax[0].grid()
ax[1].scatter(x1, y1, c=x_pred, alpha=0.5)
ax[1].set_title('Transductive Learning')
ax[1].grid()
In [218]:
df_review.head()
Out[218]:
Select the review information of data in our DataFrame.
In [219]:
df_review_chosen = df_review.loc[df_all_new.asin].reset_index()
Build the graph as we did above.
In [220]:
GB = nx.Graph()
GB.add_nodes_from(df_all_new.asin.values)
b_len_list = [] # list for # shared also_bought
r_len_list = [] # list for # shared reviewer
for i in range(len(df_all_new)):
for j in range(i+1, len(df_all_new)):
b_len_list.append(len(set(df_all_new.iloc[i]['also_bought'])&set(df_all_new.iloc[j]['also_bought'])))
r_len_list.append(len(set(df_review_chosen.iloc[i]['reviewerID'])&set(df_review_chosen.iloc[j]['reviewerID'])))
In [221]:
# Study the basics of two lists
df_b_r = pd.DataFrame({'b': b_len_list, 'r': r_len_list})
pd.DataFrame([df_b_r[df_b_r.b>0].describe().b, df_b_r[df_b_r.r>0].describe().r])
Out[221]:
Just as above, we get two skewed distribution here.
In [222]:
b_len_list.reverse()
r_len_list.reverse()
mean_b = np.mean(df_b_r[df_b_r.b>0].b)
mean_r = np.mean(df_b_r[df_b_r.r>0].r)
for i in range(len(df_all_new)):
for j in range(i+1, len(df_all_new)):
b_len = b_len_list.pop()/mean_b
r_len = r_len_list.pop()/mean_r
GB.add_edge(df_all_new.iloc[i]['asin'], df_all_new.iloc[j]['asin'], weight=0.2*b_len+0.8*r_len)
In [223]:
B = nx.adjacency_matrix(GB, weight='weight')
B.todense()
Out[223]:
Similarity Matrix
In [224]:
B.max()
Out[224]:
In [225]:
similarB = B.todense()/B.max()
np.fill_diagonal(similarB, 0)
TOP K Neighbors
In [285]:
weights = similarB.getA().copy()
NEIGHBORS = 300
# Keep only the K strongest edges
# By making weights of all other edges equal to 0
ind = np.argsort(weights)[:,::-1][:,NEIGHBORS:]
for i in range(weights.shape[1]):
weights[i][ind[i]] = 0
# Make it symmestric
bigger = weights.transpose() > weights
weights = weights - weights*bigger + weights.transpose()*bigger
Calculate Laplacian
In [286]:
degrees = np.sum(weights, axis=0)
D = np.diag(degrees)
D_0_5 = np.diag(1./np.sqrt(degrees))
laplacian = np.dot(D_0_5, np.dot((D - weights), D_0_5))
plt.spy(laplacian)
plt.show()
Eigenvalue Decomposition
In [287]:
from scipy import sparse, stats, spatial
laplacian = sparse.csr_matrix(laplacian)
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, which='SM')
plt.plot(eigenvalues, '.-', markersize=15);
Plot
In [288]:
labels = np.ones(B.shape[0])
labels[0:len(df_think_new)] = 0
labels[len(df_think_new):len(df_think_new)+len(df_mac_new)] = 1
labels[len(df_think_new)+len(df_mac_new):len(df_think_new)+len(df_mac_new)+len(df_surface_new)] = 2
In [295]:
x2 = eigenvectors[:,1]
y2 = eigenvectors[:,2]
colorList = list()
for i in x2:
if i > 0.02:
colorList.append(2)
elif i < 0:
colorList.append(1)
else:
colorList.append(0)
fig, ax = plt.subplots(2,1,figsize=(20,10))
ax[0].scatter(x2, y2, c=labels, alpha=0.5)
ax[0].set_title('Ground Truth')
ax[0].grid()
ax[1].scatter(x2, y2, c=colorList, alpha=0.5)
ax[1].set_title('Spectral Graph Theory')
ax[1].grid()
Error Rate
In [296]:
error_rate = np.sum(colorList != labels)/len(labels)
print("The error rate of Spectral Graph Theory is {:.2%}".format(error_rate))
However, in this case, the information of reviewers doesn't offer any help.
In [297]:
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.25),
SVC(gamma=2, C=1),
GaussianProcessClassifier(1.0 * RBF(1.0)),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA"]
score = []
In [298]:
X, y = np.array(B.todense()), np.array(labels)
# 90% is training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/10)
# iterate over classifiers
# without parameter tuning
for clf in classifiers:
clf.fit(X_train, y_train)
score.append(clf.score(X_test, y_test))
In [299]:
error = np.ones(np.shape(score))-score
plt.figure(figsize=(13,5))
plt.xlabel("Model")
plt.ylabel("Error Rate")
plt.title("Machine Learning Model Comparision")
plt.grid()
plt.xticks(range(len(error)), names, rotation=45)
plt.plot(range(len(error)), error, "b*-", linewidth=1)
# Use one red line to denote the error rate of spectral graph theory
plt.plot(range(len(error)), [error_rate]*len(error), "r--", linewidth=1.5);
Most of the raw machine learning models are quite good. AdaBoost, Neural Net and Linear SVM are the top three model in this case.
As shown above, with different definition of weight, we analyze both the graph of two groups and the graph of three groups of data points.
In the two-group graph, the Spectral Graph Theory performs quite good and it beats several "raw" machine learning models. And the dataset of reviewers helps a lot, which, in our opinion, results from that the information of reviewers directly reflects the connection of two products. And the information of also_bought
products is a combination of behaviours of many users and maybe some learning algorithm which maybe very useful in real life but is not that direct and obvious connection between products.
In the three-group graph, the Spectral Graph Theory also does a good job. And we can see from the results above that we can cluster three groups with machine learning algorithm and also spectral graph theory with smiliar result. However, in this case, the error rate becomes larger than that of two-group graph. In our opinion, this may results from the way we generate our network. Because the dataset is not that big and we add two layers from manually chosen cores, so the connection between each group of data may also be large and the clusters may not be that obvious. How to label our datapoints right and meaningful becomes another problem now.